1. Predictive model for a binar classification.

1.1. Data Input

1.1.1 In

##        X              age                 job            marital     
##  Min.   :    1   Min.   :17.00   admin.     :10422   divorced: 4612  
##  1st Qu.:10298   1st Qu.:32.00   blue-collar: 9254   married :24928  
##  Median :20595   Median :38.00   technician : 6743   single  :11568  
##  Mean   :20595   Mean   :40.02   services   : 3969   unknown :   80  
##  3rd Qu.:30891   3rd Qu.:47.00   management : 2924                   
##  Max.   :41188   Max.   :98.00   retired    : 1720                   
##                                  (Other)    : 6156                   
##                education        default         housing     
##  university.degree  :12168   no     :32588   no     :18622  
##  high.school        : 9515   unknown: 8597   unknown:  990  
##  basic.9y           : 6045   yes    :    3   yes    :21576  
##  professional.course: 5243                                  
##  basic.4y           : 4176                                  
##  basic.6y           : 2292                                  
##  (Other)            : 1749                                  
##       loan            contact          month       day_of_week 
##  no     :33950   cellular :10505   may    : 5536   fri : 3069  
##  unknown:  990   telephone: 5971   jul    : 2842   mon : 3421  
##  yes    : 6248   NA's     :24712   aug    : 2443   thu : 3491  
##                                    jun    : 2126   tue : 3207  
##                                    nov    : 1653   wed : 3288  
##                                    (Other): 1876   NA's:24712  
##                                    NA's   :24712               
##     duration         campaign         pdays          previous    
##  Min.   :   0.0   Min.   : 1.00   Min.   :  0.0   Min.   :0.000  
##  1st Qu.: 104.0   1st Qu.: 1.00   1st Qu.:999.0   1st Qu.:0.000  
##  Median : 182.0   Median : 2.00   Median :999.0   Median :0.000  
##  Mean   : 259.7   Mean   : 2.58   Mean   :962.5   Mean   :0.173  
##  3rd Qu.: 320.0   3rd Qu.: 3.00   3rd Qu.:999.0   3rd Qu.:0.000  
##  Max.   :4918.0   Max.   :43.00   Max.   :999.0   Max.   :7.000  
##  NA's   :24712    NA's   :24712                                  
##         poutcome      emp.var.rate      cons.price.idx  cons.conf.idx  
##  failure    : 4252   Min.   :-3.40000   Min.   :92.20   Min.   :-50.8  
##  nonexistent:35563   1st Qu.:-1.80000   1st Qu.:93.08   1st Qu.:-42.7  
##  success    : 1373   Median : 1.10000   Median :93.75   Median :-41.8  
##                      Mean   : 0.08189   Mean   :93.58   Mean   :-40.5  
##                      3rd Qu.: 1.40000   3rd Qu.:93.99   3rd Qu.:-36.4  
##                      Max.   : 1.40000   Max.   :94.77   Max.   :-26.9  
##                                         NA's   :113                    
##    euribor3m      nr.employed     y              test_control_flag
##  Min.   :0.634   Min.   :4964   no :37048   campaign group:16476  
##  1st Qu.:1.344   1st Qu.:5099   yes: 4140   control group :24712  
##  Median :4.857   Median :5191                                     
##  Mean   :3.621   Mean   :5167                                     
##  3rd Qu.:4.961   3rd Qu.:5228                                     
##  Max.   :5.045   Max.   :5228                                     
## 
  • “normalization” technically should be done on train dataset and re-apllied to all other data sub-sets (test, control and prod) - applying to all data once - for simplicity

1.1.2 data split

  • campaign_train - main exploratory and model building sample
  • campaign_test - “blind” sample" - used ONLY for final model performance evaluation
##       contact          month       day_of_week     duration    
##  cellular :    0   apr    :    0   fri :    0   Min.   : NA    
##  telephone:    0   aug    :    0   mon :    0   1st Qu.: NA    
##  NA's     :24712   dec    :    0   thu :    0   Median : NA    
##                    jul    :    0   tue :    0   Mean   :NaN    
##                    jun    :    0   wed :    0   3rd Qu.: NA    
##                    (Other):    0   NA's:24712   Max.   : NA    
##                    NA's   :24712                NA's   :24712  
##     campaign    
##  Min.   : NA    
##  1st Qu.: NA    
##  Median : NA    
##  Mean   :NaN    
##  3rd Qu.: NA    
##  Max.   : NA    
##  NA's   :24712
  • list of predictors not present in control group, considered to be removed from model.

1.2 Data Visualisation

aka qualitative analysis

1.2.1 Factors

1.2.1.1 Bar Plots
## [1] "job"

## [1] "marital"

## [1] "education"

## [1] "default"

## [1] "housing"

## [1] "loan"

## [1] "contact"

## [1] "month"

## [1] "day_of_week"

## [1] "poutcome"

1.2.1.2 Crosstables
## [1] "job"
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  14828 
## 
##  
##                       | campaign_train[, a] 
## campaign_train[, "y"] |        admin. |   blue-collar |  entrepreneur |     housemaid |    management |       retired | self-employed |      services |       student |    technician |    unemployed |       unknown |     Row Total | 
## ----------------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|
##                    no |          3234 |          3072 |           505 |           326 |           944 |           487 |           438 |          1344 |           221 |          2172 |           297 |           118 |         13158 | 
##                       |         0.867 |         0.930 |         0.932 |         0.886 |         0.891 |         0.768 |         0.892 |         0.919 |         0.682 |         0.892 |         0.856 |         0.908 |               | 
## ----------------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|
##                   yes |           495 |           232 |            37 |            42 |           116 |           147 |            53 |           119 |           103 |           264 |            50 |            12 |          1670 | 
##                       |         0.133 |         0.070 |         0.068 |         0.114 |         0.109 |         0.232 |         0.108 |         0.081 |         0.318 |         0.108 |         0.144 |         0.092 |               | 
## ----------------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|
##          Column Total |          3729 |          3304 |           542 |           368 |          1060 |           634 |           491 |          1463 |           324 |          2436 |           347 |           130 |         14828 | 
##                       |         0.251 |         0.223 |         0.037 |         0.025 |         0.071 |         0.043 |         0.033 |         0.099 |         0.022 |         0.164 |         0.023 |         0.009 |               | 
## ----------------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|
## 
##  
## [1] "marital"
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  14828 
## 
##  
##                       | campaign_train[, a] 
## campaign_train[, "y"] |  divorced |   married |    single |   unknown | Row Total | 
## ----------------------|-----------|-----------|-----------|-----------|-----------|
##                    no |      1479 |      8055 |      3599 |        25 |     13158 | 
##                       |     0.897 |     0.900 |     0.856 |     0.862 |           | 
## ----------------------|-----------|-----------|-----------|-----------|-----------|
##                   yes |       170 |       892 |       604 |         4 |      1670 | 
##                       |     0.103 |     0.100 |     0.144 |     0.138 |           | 
## ----------------------|-----------|-----------|-----------|-----------|-----------|
##          Column Total |      1649 |      8947 |      4203 |        29 |     14828 | 
##                       |     0.111 |     0.603 |     0.283 |     0.002 |           | 
## ----------------------|-----------|-----------|-----------|-----------|-----------|
## 
##  
## [1] "education"
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  14828 
## 
##  
##                       | campaign_train[, a] 
## campaign_train[, "y"] |            basic.4y |            basic.6y |            basic.9y |         high.school |          illiterate | professional.course |   university.degree |             unknown |           Row Total | 
## ----------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|
##                    no |                1294 |                 767 |                2011 |                3083 |                   3 |                1708 |                3767 |                 525 |               13158 | 
##                       |               0.899 |               0.926 |               0.924 |               0.891 |               0.600 |               0.894 |               0.859 |               0.850 |                     | 
## ----------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|
##                   yes |                 146 |                  61 |                 166 |                 379 |                   2 |                 203 |                 620 |                  93 |                1670 | 
##                       |               0.101 |               0.074 |               0.076 |               0.109 |               0.400 |               0.106 |               0.141 |               0.150 |                     | 
## ----------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|
##          Column Total |                1440 |                 828 |                2177 |                3462 |                   5 |                1911 |                4387 |                 618 |               14828 | 
##                       |               0.097 |               0.056 |               0.147 |               0.233 |               0.000 |               0.129 |               0.296 |               0.042 |                     | 
## ----------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|
## 
##  
## [1] "default"
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  14828 
## 
##  
##                       | campaign_train[, a] 
## campaign_train[, "y"] |        no |   unknown |       yes | Row Total | 
## ----------------------|-----------|-----------|-----------|-----------|
##                    no |     10225 |      2932 |         1 |     13158 | 
##                       |     0.871 |     0.950 |     1.000 |           | 
## ----------------------|-----------|-----------|-----------|-----------|
##                   yes |      1515 |       155 |         0 |      1670 | 
##                       |     0.129 |     0.050 |     0.000 |           | 
## ----------------------|-----------|-----------|-----------|-----------|
##          Column Total |     11740 |      3087 |         1 |     14828 | 
##                       |     0.792 |     0.208 |     0.000 |           | 
## ----------------------|-----------|-----------|-----------|-----------|
## 
##  
## [1] "housing"
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  14828 
## 
##  
##                       | campaign_train[, a] 
## campaign_train[, "y"] |        no |   unknown |       yes | Row Total | 
## ----------------------|-----------|-----------|-----------|-----------|
##                    no |      5952 |       342 |      6864 |     13158 | 
##                       |     0.889 |     0.905 |     0.885 |           | 
## ----------------------|-----------|-----------|-----------|-----------|
##                   yes |       745 |        36 |       889 |      1670 | 
##                       |     0.111 |     0.095 |     0.115 |           | 
## ----------------------|-----------|-----------|-----------|-----------|
##          Column Total |      6697 |       378 |      7753 |     14828 | 
##                       |     0.452 |     0.025 |     0.523 |           | 
## ----------------------|-----------|-----------|-----------|-----------|
## 
##  
## [1] "loan"
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  14828 
## 
##  
##                       | campaign_train[, a] 
## campaign_train[, "y"] |        no |   unknown |       yes | Row Total | 
## ----------------------|-----------|-----------|-----------|-----------|
##                    no |     10820 |       342 |      1996 |     13158 | 
##                       |     0.885 |     0.905 |     0.897 |           | 
## ----------------------|-----------|-----------|-----------|-----------|
##                   yes |      1404 |        36 |       230 |      1670 | 
##                       |     0.115 |     0.095 |     0.103 |           | 
## ----------------------|-----------|-----------|-----------|-----------|
##          Column Total |     12224 |       378 |      2226 |     14828 | 
##                       |     0.824 |     0.025 |     0.150 |           | 
## ----------------------|-----------|-----------|-----------|-----------|
## 
##  
## [1] "contact"
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  14828 
## 
##  
##                       | campaign_train[, a] 
## campaign_train[, "y"] |  cellular | telephone | Row Total | 
## ----------------------|-----------|-----------|-----------|
##                    no |      8068 |      5090 |     13158 | 
##                       |     0.853 |     0.948 |           | 
## ----------------------|-----------|-----------|-----------|
##                   yes |      1392 |       278 |      1670 | 
##                       |     0.147 |     0.052 |           | 
## ----------------------|-----------|-----------|-----------|
##          Column Total |      9460 |      5368 |     14828 | 
##                       |     0.638 |     0.362 |           | 
## ----------------------|-----------|-----------|-----------|
## 
##  
## [1] "month"
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  14828 
## 
##  
##                       | campaign_train[, a] 
## campaign_train[, "y"] |       apr |       aug |       dec |       jul |       jun |       mar |       may |       nov |       oct |       sep | Row Total | 
## ----------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                    no |       792 |      1924 |        38 |      2335 |      1709 |        89 |      4687 |      1333 |       130 |       121 |     13158 | 
##                       |     0.798 |     0.883 |     0.567 |     0.917 |     0.894 |     0.486 |     0.934 |     0.899 |     0.546 |     0.573 |           | 
## ----------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                   yes |       201 |       255 |        29 |       210 |       202 |        94 |       331 |       150 |       108 |        90 |      1670 | 
##                       |     0.202 |     0.117 |     0.433 |     0.083 |     0.106 |     0.514 |     0.066 |     0.101 |     0.454 |     0.427 |           | 
## ----------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##          Column Total |       993 |      2179 |        67 |      2545 |      1911 |       183 |      5018 |      1483 |       238 |       211 |     14828 | 
##                       |     0.067 |     0.147 |     0.005 |     0.172 |     0.129 |     0.012 |     0.338 |     0.100 |     0.016 |     0.014 |           | 
## ----------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 
##  
## [1] "day_of_week"
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  14828 
## 
##  
##                       | campaign_train[, a] 
## campaign_train[, "y"] |       fri |       mon |       thu |       tue |       wed | Row Total | 
## ----------------------|-----------|-----------|-----------|-----------|-----------|-----------|
##                    no |      2442 |      2792 |      2762 |      2545 |      2617 |     13158 | 
##                       |     0.895 |     0.908 |     0.872 |     0.880 |     0.882 |           | 
## ----------------------|-----------|-----------|-----------|-----------|-----------|-----------|
##                   yes |       287 |       282 |       404 |       348 |       349 |      1670 | 
##                       |     0.105 |     0.092 |     0.128 |     0.120 |     0.118 |           | 
## ----------------------|-----------|-----------|-----------|-----------|-----------|-----------|
##          Column Total |      2729 |      3074 |      3166 |      2893 |      2966 |     14828 | 
##                       |     0.184 |     0.207 |     0.214 |     0.195 |     0.200 |           | 
## ----------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 
##  
## [1] "poutcome"
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  14828 
## 
##  
##                       | campaign_train[, a] 
## campaign_train[, "y"] |     failure | nonexistent |     success |   Row Total | 
## ----------------------|-------------|-------------|-------------|-------------|
##                    no |        1317 |       11677 |         164 |       13158 | 
##                       |       0.863 |       0.912 |       0.331 |             | 
## ----------------------|-------------|-------------|-------------|-------------|
##                   yes |         209 |        1129 |         332 |        1670 | 
##                       |       0.137 |       0.088 |       0.669 |             | 
## ----------------------|-------------|-------------|-------------|-------------|
##          Column Total |        1526 |       12806 |         496 |       14828 | 
##                       |       0.103 |       0.864 |       0.033 |             | 
## ----------------------|-------------|-------------|-------------|-------------|
## 
## 
1.2.1.3 Barplots crossed
## [1] "job against marital"

## [1] "job against education"

## [1] "job against default"

## [1] "job against housing"

## [1] "job against loan"

## [1] "job against contact"

## [1] "job against month"

## [1] "job against day_of_week"

## [1] "job against poutcome"

## [1] "marital against job"

## [1] "marital against education"

## [1] "marital against default"

## [1] "marital against housing"

## [1] "marital against loan"

## [1] "marital against contact"

## [1] "marital against month"

## [1] "marital against day_of_week"

## [1] "marital against poutcome"

## [1] "education against job"

## [1] "education against marital"

## [1] "education against default"

## [1] "education against housing"

## [1] "education against loan"

## [1] "education against contact"

## [1] "education against month"

## [1] "education against day_of_week"

## [1] "education against poutcome"

## [1] "default against job"

## [1] "default against marital"

## [1] "default against education"

## [1] "default against housing"

## [1] "default against loan"

## [1] "default against contact"

## [1] "default against month"

## [1] "default against day_of_week"

## [1] "default against poutcome"

## [1] "housing against job"

## [1] "housing against marital"

## [1] "housing against education"

## [1] "housing against default"

## [1] "housing against loan"

## [1] "housing against contact"

## [1] "housing against month"

## [1] "housing against day_of_week"

## [1] "housing against poutcome"

## [1] "loan against job"

## [1] "loan against marital"

## [1] "loan against education"

## [1] "loan against default"

## [1] "loan against housing"

## [1] "loan against contact"

## [1] "loan against month"

## [1] "loan against day_of_week"

## [1] "loan against poutcome"

## [1] "contact against job"

## [1] "contact against marital"

## [1] "contact against education"

## [1] "contact against default"

## [1] "contact against housing"

## [1] "contact against loan"

## [1] "contact against month"

## [1] "contact against day_of_week"

## [1] "contact against poutcome"

## [1] "month against job"

## [1] "month against marital"

## [1] "month against education"

## [1] "month against default"

## [1] "month against housing"

## [1] "month against loan"

## [1] "month against contact"

## [1] "month against day_of_week"

## [1] "month against poutcome"

## [1] "day_of_week against job"

## [1] "day_of_week against marital"

## [1] "day_of_week against education"

## [1] "day_of_week against default"

## [1] "day_of_week against housing"

## [1] "day_of_week against loan"

## [1] "day_of_week against contact"

## [1] "day_of_week against month"

## [1] "day_of_week against poutcome"

## [1] "poutcome against job"

## [1] "poutcome against marital"

## [1] "poutcome against education"

## [1] "poutcome against default"

## [1] "poutcome against housing"

## [1] "poutcome against loan"

## [1] "poutcome against contact"

## [1] "poutcome against month"

## [1] "poutcome against day_of_week"

1.2.2 Continues Variables

Numeric variables

## [1] "euribor3m"

## [1] "nr.employed"

## [1] "cons.price.idx"

## [1] "cons.conf.idx"

## [1] "previous"

## [1] "pdays"

## [1] "campaign"

## [1] "duration"

## [1] "age"

1.2.3 Conclusions

Factor variables

  • “job” - stays in
  • “marital status” - stays in
  • education - stays in but dropping “illiterate” level because of low count. (less then 0.04%)
  • default - stays in but dropping “yes” level (less then 0.03%)
  • “housing” - not to be used for modeling because of low varation.
  • “loan” - as above
  • “poutcome” - stays in

  • Contact info - dropping all - because they are unknown at prediction time in our use case:
  • “contact”
  • “month”
  • “day_of_week”

Continuous variables

  • “euribor3m” - stays in. High variablility on levels close to 0 and close to 1
  • “nr.employed” - stays in
  • “cons.price.idx” - stays in. Very different proportions depending on x value
  • “cons.conf.idx” - stays in.
  • “previous” - 2 and more guatanties “yes” - let’s change its levels to 0, 1, 2+
  • “pdays”" - not to be used for modeling. More than 90% is currently NA. if number of days from previous contact is more numerous, it might be quite a valuable variable.
  • “age” - stays in

  • Contact info - dropping all - because they are unknown at prediction time in our use case:
    “campaign” “duratrion”

1.3 Model building

Goal: a classification problem model - yes/no. Also - based on qualitive analysis - high class inbalance. Therefore accuracy metric should NOT be used. Area under curve and ROC will be used for model selection.

Approach: 1. training 5 fold cross-validation on training sample 2. model choice using testing sample

1.3.1 Model #1 - Decision Tree

## CART 
## 
## 14822 samples
##    14 predictor
##     2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 5 times) 
## Summary of sample sizes: 11858, 11857, 11858, 11858, 11857, 11857, ... 
## Resampling results across tuning parameters:
## 
##   cp           ROC        Sens       Spec     
##   0.004196643  0.7053988  0.9896152  0.1900394
##   0.005095923  0.7050815  0.9907251  0.1792527
##   0.055455635  0.6193921  0.9945416  0.0978400
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.004196643.

1.3.2 Model #2 - SVM

(radial)

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 14820 samples
##    14 predictor
##     2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 5 times) 
## Summary of sample sizes: 11857, 11856, 11856, 11855, 11856, 11856, ... 
## Resampling results across tuning parameters:
## 
##   C     ROC        Sens       Spec      
##   0.25  0.6458148  0.9960003  0.02482063
##   0.50  0.6464443  0.9916052  0.05910162
##   1.00  0.6386480  0.9917727  0.05177789
## 
## Tuning parameter 'sigma' was held constant at a value of 6.095125e-07
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 6.095125e-07 and C = 0.5.

1.3.2 Model #3 - Random Forest

## Random Forest 
## 
## 14820 samples
##    14 predictor
##     2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 11855, 11857, 11856, 11857, 11855 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens       Spec     
##    2    0.7742432  0.9916362  0.1588678
##   17    0.7716569  0.9655567  0.2967524
##   33    0.7673464  0.9649483  0.2889572
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

1.3.4 Model #4 - GBM

## Stochastic Gradient Boosting 
## 
## 14820 samples
##    14 predictor
##     2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 5 times) 
## Summary of sample sizes: 11856, 11856, 11855, 11857, 11856, 11855, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  ROC        Sens       Spec     
##   1                   50      0.7812019  0.9897810  0.1875284
##   1                  100      0.7879244  0.9880019  0.2043114
##   1                  150      0.7883184  0.9871654  0.2115060
##   2                   50      0.7875218  0.9879106  0.2055137
##   2                  100      0.7898940  0.9862074  0.2244520
##   2                  150      0.7900628  0.9856904  0.2322472
##   3                   50      0.7885373  0.9870438  0.2184532
##   3                  100      0.7907410  0.9855079  0.2350031
##   3                  150      0.7910849  0.9843674  0.2371660
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.

1.3.5. Choosing the best Model

1.4 Model Summary

Random forrest shows highest Area Under Curve. Choosing Random Forest for uplift calculation.

2. Uplift

Uplift = increase in probability of “buying” due to the campaign action.

Approach: 1. using model train on campaign set - model predicted probabilities assume the subject (row) was exposed to campaign. 2. using control set data predict probabilities with random forest model to simulate applying campaign to control set.

2.1. Data Preparation

Rows were removed in campaign set - to match training sample structure. In real life application the prediction would be “0” (or “no”) for these rows - to avoid over optimistic uplift estimation - here omitted for simplicity.

2.2 Base average probability of buying

Calculate as simple ratio of “yes” in control group.

Calculated average base probability of ‘term deposit’ is 0.0928171

2.3 Uplift calc

We are taking 16476 rows to match the number of calls made in the refference campaign. This represents applying the same campaign cost to a new campaign. In other words we are choosing 16476 most promising future customers/buyers.

Obtained uplift = 0.0885294. Calculated as mean of per row differences between base and predicted probability of “term deposit” for top 16476 predicted probabilities.

2.4 Summary

  1. Proposed approach shows ~8 percentage point uplift
  2. Performed actions:
    2.1. Data wrangling and clean-up
    2.2. Data interpretation and variable choice (removed contact group variables)
    2.3. Model building with cross-validation
    2.4. Best model choice based on AUC 2.5. Uplift calculation
  3. proposed next steps to improve

3. Next Steps

These are PROPOSED next steps to improve this analysis

  1. variable impact assesment aka “opening the black box” - which X’s are “significant” and how
  2. variables cross-correlations (Started in 1.2.1.3)
  3. for some contact variables - we could check prediction sensitivity to those variables - by including in the prediction model and running simlation - in order to check if there is a valid business case in paying for controlling them…
  4. alternative approach: build model using both: campaign and control set with campaign/control flag (categorical variable). Coefficient of that categorical variable would be a proxy for uplift.